If you follow me, you know that this year I started a series called Weekly Digest for Data Science and AI: Python & R , where I highlighted the best libraries, repos, packages, and tools that help us be better data scientists for all kinds of tasks.

The great folks at Heartbeat sponsored a lot of these digests, and they asked me to create a list of the best of the best—those libraries that really changed or improved the way we worked this year (and beyond).

Top 7 for Python

7. AdaNet — Fast and flexible AutoML with learning guarantees.

AdaNet is a lightweight and scalable TensorFlow AutoML framework for training and deploying adaptive neural networks using the AdaNet algorithm [ Cortes et al. ICML 2017 ]. AdaNet combines several learned subnetworks in order to mitigate the complexity inherent in designing effective neural networks.

This package will help you selecting optimal neural network architectures, implementing an adaptive algorithm for learning a neural architecture as an ensemble of subnetworks.

You will need to know TensorFlow to use the package because it implements a TensorFlow Estimator, but this will help you simplify your machine learning programming by encapsulating training and also evaluation, prediction and export for serving.

You can build an ensemble of neural networks, and the library will help you optimize an objective that balances the trade-offs between the ensemble’s performance on the training set and its ability to generalize to unseen data.

Installation


adanet

depends on bug fixes and enhancements not present in TensorFlow releases prior to 1.7. You must install or upgrade your TensorFlow package to at least 1.7:

Installing from source

To install from source, you’ll first need to install


bazel

following their installation instructions .

Once you have verified that everything works well, install


adanet

as a pip package .

Usage

6. TPOT— An automated Python machine learning tool that optimizes machine learning pipelines using genetic programming.

https://github.com/EpistasisLab/tpot

Previously I talked about Auto-Keras, a great library for AutoML in the Pythonic world. Well, I have another very interesting tool for that.

The name is TPOT (Tree-based Pipeline Optimization Tool), and it’s an amazing library. It’s basically a Python automated machine learning tool that optimizes machine learning pipelines using genetic programming .

TPOT can automate a lot of stuff life feature selection, model selection, feature construction, and much more. Luckily, if you’re a Python machine learner, TPOT is built on top of Scikit-learn, so all of the code it generates should look familiar.

What it does is automate the most tedious parts of machine learning by intelligently exploring thousands of possible pipelines to find the best one for your data, and then it provides you with the Python code for the best pipeline it found so you can tinker with the pipeline from there.

This is how it works:

For more details you can read theses great article by Matthew Mayo :

Using AutoML to Generate Machine Learning Pipelines with TPOT
Thus far in this series of posts we have: This post will take a different approach to constructing pipelines. Certainly… www.kdnuggets.com

and Randy Olson :

TPOT: A Python Tool for Automating Data Science
By Randy Olson, University of Pennsylvania. Machine learning is often touted as: A field of study that gives computers… www.kdnuggets.com

Installation

You actually need to follow some instructions before installing TPOT. Here they are:

Installation — TPOT
Optionally, you can install XGBoost if you would like TPOT to use the eXtreme Gradient Boosting models. XGBoost is… epistasislab.github.io

After that you can just run:


pip install tpot

Examples:

First let’s start with the basic Iris dataset:

So here we built a very basic TPOT pipeline that will try to look for the best ML pipeline to predict the iris.target . And then we save that pipeline. After that, what we have to do is very simple — load the .py file you generated and you’ll see:

import  numpy  as  np

from  sklearn.kernel_approximation  import RBFSampler
 from  sklearn.model_selection  import train_test_split
 from  sklearn.pipeline  import make_pipeline
 from  sklearn.tree  import DecisionTreeClassifier


# NOTE: Make sure that the class is labeled 'class' in the data file


tpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR', dtype=np.float64)

features = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1), tpot_data.dtype.names.index('class'), axis=1)

training_features, testing_features, training_classes, testing_classes = \

    train_test_split(features, tpot_data['class'], random_state=42)

exported_pipeline = make_pipeline(

    RBFSampler(gamma=0.8500000000000001),

    DecisionTreeClassifier(criterion="entropy", max_depth=3, min_samples_leaf=4, min_samples_split=9)

)

exported_pipeline.fit(training_features, training_classes)

results = exported_pipeline.predict(testing_features)

And that’s it. You built a classifier for the Iris dataset in a simple but powerful way.

Let’s go the MNIST dataset now:

As you can see, we did the same! Let’s load the .py file you generated again and you’ll see:


import
 
 numpy
 
 as
 
 np


from
 
 sklearn.model_selection
 
 import
 train_test_split

 
from
 
 sklearn.neighbors
 
 import
 KNeighborsClassifier


# NOTE: Make sure that the class is labeled 'class' in the data file


tpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR', dtype=np.float64)

features = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1), tpot_data.dtype.names.index('class'), axis=1)

training_features, testing_features, training_classes, testing_classes = \

    train_test_split(features, tpot_data['class'], random_state=42)

exported_pipeline = KNeighborsClassifier(n_neighbors=4, p=2, weights="distance")

exported_pipeline.fit(training_features, training_classes)

results = exported_pipeline.predict(testing_features)

Super easy and fun. Check them out! Try it and please give them a star!

5. SHAP — A unified approach to explain the output of any machine learning model

Explaining machine learning models isn’t always easy. Yet it’s so important for a range of business applications. Luckily, there are some great libraries that help us with this task. In many applications, we need to know, understand, or prove how input variables are used in the model, and how they impact final model predictions.

SHAP (SHapley Additive exPlanations) is a unified approach to explain the output of any machine learning model. SHAP connects game theory with local explanations, uniting several previous methods and representing the only possible consistent and locally accurate additive feature attribution method based on expectations.

Installation

SHAP can be installed from PyPI


pip install shap

or conda-forge


conda install -c conda-forge shap

Usage

There are tons of different models and ways to use the package. Here, I’ll take one example from the DeepExplainer.

Deep SHAP is a high-speed approximation algorithm for SHAP values in deep learning models that builds on a connection with DeepLIFT , as described in the SHAP NIPS paper that you can read here:

[1802.03888] Consistent Individualized Feature Attribution for Tree Ensembles
Abstract: Interpreting predictions from tree ensemble methods such as gradient boosting machines and random forests is… arxiv.org

Here you can see how SHAP can be used to explain the result of a Keras model for the MNIST dataset:

You can find more examples here:

slundberg/shap
A unified approach to explain the output of any machine learning model. — slundberg/shap github.com

Take a look. You’ll be surprised :)

4. Optimus — 🚚 Agile Data Science Workflows made easy with Python and Spark.

Ok, so full disclosure, this library is like my baby. I’ve been working on it for a long time now, and I’m very happy to show you version 2.

Optimus V2 was created to make data cleaning a breeze. The API was designed to be super easy for newcomers and very familiar for people that come from working with pandas. Optimus expands the Spark DataFrame functionality, adding .rows and .cols attributes.

With Optimus you can clean your data, prepare it, analyze it, create profilers and plots, and perform machine learning and deep learning, all in a distributed fashion, because on the back-end we have Spark, TensorFlow, and Keras.

It’s super easy to us. It’s like the evolution of pandas, with a piece of dplyr, joined by Keras and Spark. The code you create with Optimus will work on your local machine, and with a simple change of masters, it can run on your local cluster or in the cloud.

You will see a lot of interesting functions created to help with every step of the data science cycle.

Optimus is perfect as a companion for an agile methodology for data science because it can help you in almost all the steps of the process, and it can easily connect to other libraries and tools.

If you want to read more about an Agile DS Methodology check this out:

Agile Framework For Creating An ROI-Driven Data Science Practice
Data Science is an amazing field of research that is under active development both from the academia and the industry… www.business-science.io

Installation (pip):


pip install optimuspyspark

Usage:

As one example, you can load data from a url, transform it, and apply some predefined cleaning functions:

from optimus import Optimus

op = Optimus()

# This is a custom function

def func(value, arg):

    return "this was a number"

    

df =op.load.url("https://raw.githubusercontent.com/ironmussa/Optimus/master/examples/foo.csv")

df\

    .rows.sort("product","desc")\

    .cols.lower(["firstName","lastName"])\

    .cols.date_transform("birth", "new_date", "yyyy/MM/dd", "dd-MM-YYYY")\

    .cols.years_between("birth", "years_between", "yyyy/MM/dd")\

    .cols.remove_accents("lastName")\

    .cols.remove_special_chars("lastName")\

    .cols.replace("product","taaaccoo","taco")\

    .cols.replace("product",["piza","pizzza"],"pizza")\

    .rows.drop(df["id"]<7)\

    .cols.drop("dummyCol")\

    .cols.rename(str.lower)\

    .cols.apply_by_dtypes("product",func,"string", data_type="integer")\

    .cols.trim("*")\

    .show()

You can transform this:

into this:

Pretty cool, right?

You can do a thousand more things with the library, so please check it out:

Optimus — Data cleansing and exploration made simple
Prepare, process and explore your Big Data with the fastest open source library on the planet using Apache Spark and… www.hioptimus.com

3. spacy — Industrial-strength Natural Language Processing (NLP) with Python and Cython

From the creators :

spaCy is designed to help you do real work — to build real products, or gather real insights. The library respects your time, and tries to avoid wasting it. It’s easy to install, and its API is simple and productive. We like to think of spaCy as the Ruby on Rails of Natural Language Processing.

spaCy is the best way to prepare text for deep learning. It interoperates seamlessly with TensorFlow, PyTorch, Scikit-learn, Gensim, and the rest of Python’s awesome AI ecosystem. With spaCy, you can easily construct linguistically sophisticated statistical models for a variety of NLP problems.

Installation:

pip3 install spacy

$ python3 -m spacy download en

Here, we’re also downloading the English language model. You can find models for German, Spanish, Italian, Portuguese, French, and more here:

Models Overview · spaCy Models Documentation
spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency… spacy.io

Here’s an example from the main webpage:

# python -m spacy download en_core_web_sm

import spacy

# Load English tokenizer, tagger, parser, NER and word vectors

nlp = spacy.load('en_core_web_sm')

# Process whole documents

text = (u"When Sebastian Thrun started working on self-driving cars at "

        u"Google in 2007, few people outside of the company took him "

        u"seriously. “I can tell you very senior CEOs of major American "

        u"car companies would shake my hand and turn away because I wasn’t "

        u"worth talking to,” said Thrun, now the co-founder and CEO of "

        u"online higher education startup Udacity, in an interview with "

        u"Recode earlier this week.")

    doc = nlp(text)

# Find named entities, phrases and concepts

for entity in doc.ents:

    print(entity.text, entity.label_)

# Determine semantic similarities

doc1 = nlp(u"my fries were super gross")

doc2 = nlp(u"such disgusting fries")

similarity = doc1.similarity(doc2)

print(doc1.text, doc2.text, similarity)

In this example, we first download the English tokenizer, tagger, parser, NER, and word vectors. Then we create some text, and finally we print the entities, phrases, and concepts found, and then we determine the semantic similarity of the two phrases. If you run this code you get this:

Sebastian Thrun PERSON

Google ORG

2007 DATE

American NORP

Thrun PERSON

Recode ORG

earlier this week DATE

my fries were super gross such disgusting fries 0.7139701635071919

Very simple and super useful. There is also a spaCy Universe, where you can find great resources developed with or for spaCy. It includes standalone packages, plugins, extensions, educational materials, operational utilities, and bindings for other languages:

Universe · spaCy
This section collects the many great resources developed with or for spaCy. It includes standalone packages, plugins… spacy.io

By the way, the usage page is great, with very good explanations and code:

Install spaCy · spaCy Usage Documentation
spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency… spacy.io

Take a look at the visualizers page. Awesome features, here:

Visualizers · spaCy Usage Documentation
spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency… spacy.io

2. jupytext — Jupyter notebooks as Markdown Documents, Julia, Python or R scripts

For me, this is one of the packages of the year. It’s such an important part of what we do as data scientists. Almost all of us work in notebooks like Jupyter, but we also use IDEs like PyCharm for more hardcore parts of our projects.

The good news is that plain scripts, which you can draft and test in your favorite IDE, open transparently as notebooks in Jupyter when using Jupytext. Run the notebook in Jupyter to generate the outputs, associate an .ipynb representation, and save and share your research as either a plain script or as a traditional Jupyter notebook with outputs.

You can see a workflow of what you can do with the package in the gif below:

Installation

Install Jupytext with:

pip install jupytext --upgrade

Then, configure Jupyter to use Jupytext:

generate a Jupyter config, if you don’t have one yet, with jupyter notebook --generate-config
edit .jupyter/jupyter_notebook_config.py and append the following:

c.NotebookApp.contents_manager_class = "jupytext.TextFileContentsManager"

and restart Jupyter, i.e. run:

jupyter notebook

You can give it a try here:

Binder (beta)
https://mybinder.org/v2/gh/mwouts/jupytext/master?filepath=demo mybinder.org

1. Chartify — Python library that makes it easy for data scientists to create charts.

https://xkcd.com/1945/

This, for me, is the winner of the year, for Python. If you are in the Python world, most likely you waste a lot of your time trying to create a decent plot. Luckily, we have libraries like Seaborn that make our life easier. But the issue is that their plots are not dynamic.

Then you have Bokeh—an amazing library—but creating interactive plots with it can be a pain in the a**. If you want to know more about Bokeh and interactive plots for Data Science, take a look at these great articles by William Koehrsen :

Data Visualization with Bokeh in Python, Part I: Getting Started
Elevate your visualization game towardsdatascience.com

Data Visualization with Bokeh in Python, Part II: Interactions
Moving beyond static plots towardsdatascience.com

Data Visualization with Bokeh in Python, Part III: Making a Complete Dashboard
Creating an interactive visualization application in Bokeh towardsdatascience.com

Chartify is built in top of Bokeh. But it’s also so much simpler.

From the authors:

Why use Chartify?

Consistent input data format: Spend less time transforming data to get your charts to work. All plotting functions use a consistent tidy input data format.
Smart default styles: Create pretty charts with very little customization required.
Simple API: We’ve attempted to make to the API as intuitive and easy to learn as possible.
Flexibility: Chartify is built on top of Bokeh , so if you do need more control you can always fall back on Bokeh’s API.

Installation

Chartify can be installed via pip:


pip3 install chartify

2. Install chromedriver requirement (Optional. Needed for PNG output):

Install Google Chrome.
Download the appropriate version of chromedriver for your OS here .
Copy the executable file to a directory within your PATH.
View directories in your PATH variable: echo $PATH
Copy chromedriver to the appropriate directory, e.g.: cp chromedriver /usr/local/bin

Usage

Let’s say we want to create this chart:

import pandas as pd

import chartify

# Generate example data

data = chartify.examples.example_data()

Now that we have some example data loaded let’s do some transformations:

total_quantity_by_month_and_fruit = (data.groupby(

        [data['date'] + pd.offsets.MonthBegin(-1), 'fruit'])['quantity'].sum()

        .reset_index().rename(columns={'date': 'month'})

        .sort_values('month'))

    print(total_quantity_by_month_and_fruit.head())

month          fruit     quantity

0 2017-01-01   Apple         7

1 2017-01-01  Banana         6

2 2017-01-01   Grape         1

3 2017-01-01  Orange         2

4 2017-02-01   Apple         8

And now we can plot it:

# Plot the data

ch = chartify.Chart(blank_labels=True, x_axis_type='datetime')

ch.set_title("Stacked area")

ch.set_subtitle("Represent changes in distribution.")

ch.plot.area(

        data_frame=total_quantity_by_month_and_fruit,

        x_column='month',

        y_column='quantity',

        color_column='fruit',

        stacked=True)

    ch.show('png')

Super easy to create a plot, and it’s interactive. If you want more examples to create stuff like this:

And more, check the original repo:

spotify/chartify
Python library that makes it easy for data scientists to create charts. — spotify/chartify github.com

Top 7 for R

7. infer — An R package for tidyverse-friendly statistical inference

https://github.com/tidymodels/infer

Inference, or statistical inference, is the process of using data analysis to deduce properties of an underlying probability distribution.

The objective of this package is to perform statistical inference using an expressive statistical grammar that coheres with the tidyverse design framework.

Installation

To install the current stable version of infer from CRAN:

install.packages("infer")

Usage

Let’s try a simple example on the mtcars dataset to see what the library can do for us.

First, let’s overwrite mtcars so that the variables cyl , vs , am , gear , and carb are factor s.


library(infer)

library(dplyr)

mtcars <- mtcars %>%

  mutate(cyl = factor(cyl),

         vs = factor(vs),

         am = factor(am),

         gear = factor(gear),

         carb = factor(carb))

         # For reproducibility         

set.seed(2018)

We’ll try hypothesis testing. Here, a hypothesis is proposed so that it’s testable on the basis of observing a process that’s modeled via a set of random variables. Normally, two statistical data sets are compared, or a data set obtained by sampling is compared against a synthetic data set from an idealized model.


mtcars %>%

  
  specify
(response = mpg) %>% # formula alt: mpg ~ NULL

  
  hypothesize
(null = "point", med = 26) %>% 

  
  generate
(reps = 100, type = "bootstrap") %>% 

  
  calculate
(stat = "median")

Here, we first specify the response and explanatory variables, then we declare a null hypothesis. After that, we generate resamples using bootstrap and finally calculate the median. The result of that code is:


## # A tibble: 100 x 2

##    replicate  stat

##        <int> <dbl>

##  1         1  26.6

##  2         2  25.1

##  3         3  25.2

##  4         4  24.7

##  5         5  24.6

##  6         6  25.8

##  7         7  24.7

##  8         8  25.6

##  9         9  25.0

## 10        10  25.1

## # ... with 90 more rows

One of the greatest parts of this library is the visualize function. This will allow you to visualize the distribution of the simulation-based inferential statistics or the theoretical distribution (or both). For an example, let’s use the flights data set. First, let’s do some data preparation:


library(nycflights13)
library(dplyr)
library(ggplot2)
library(stringr)
library(infer)
set.seed(2017)
fli_small <- flights %>% 
  na.omit() %>% 
  
  sample_n
(size = 500) %>% 
  
  mutate
(season = 
case_when
(
    month %in% c(10:12, 1:3) ~ "winter",
    month %in% c(4:9) ~ "summer"
  )) %>% 
  
  mutate
(day_hour = 
case_when
(
    
    between
(hour, 1, 12) ~ "morning",
    
    between
(hour, 13, 24) ~ "not morning"
  )) %>% 
  
  select
(arr_delay, dep_delay, season, 
         day_hour, origin, carrier)

And now we can run a randomization approach to χ2-statistic :


chisq_null_distn <- fli_small %>%

  
  specify
(origin ~ season) %>% # alt: response = origin, explanatory = season

  
  hypothesize
(null = "independence") %>%

  
  generate
(reps = 1000, type = "permute") %>%

  
  calculate
(stat = "Chisq")

chisq_null_distn %>% 
visualize
(obs_stat = obs_chisq, direction = "greater")

or see the theoretical distribution:fli_small %>% specify (origin ~ season) %>% hypothesize (null = "independence") %>% # generate() ## Not used for theoretical calculate (stat = "Chisq") %>% visualize (method = "theoretical", obs_stat = obs_chisq, direction = "right")

For more on this package visit: Tidy Statistical Inference
The objective of this package is to perform inference using an expressive statistical grammar that coheres with the… infer.netlify.com

6. janitor — simple tools for data cleaning in R

https://github.com/sfirke/janitor

Data cleansing is a topic very close to me. I’ve been working with my team at Iron-AI to create a tool for Python called Optimus. You can see more about it here:

Data cleansing and exploration with Python and Apache Spark — Big Data and Data Science — Optimus
The group of BBVA Data & Analytics in Mexico has been using Optimus for the past months and we have boosted our… hioptimus.com

But this tool I’m showing you is a very cool package with simple functions for data cleaning.

It has three main functions:

perfectly format data.frame column names;
create and format frequency tables of one, two, or three variables (think an improved table() ; and
isolate partially-duplicate records.

Oh, and it’s a tidyverse -oriented package. Specifically, it works nicely with the %>% pipe and is optimized for cleaning data brought in with the readr and readxl packages.

Installation

install.packages("janitor")

Usage

I’m using the example from the repo, and the data dirty_data.xlsx .

library(pacman) # for loading packages

p_load(readxl, janitor, dplyr, here)

roster_raw <- read_excel(here("dirty_data.xlsx")) # available at 
http://github.com/sfirke/janitor


glimpse(roster_raw)

#> Observations: 13

#> Variables: 11

#> $ `First Name`        <chr> "Jason", "Jason", "Alicia", "Ada", "Desus", "Chien-Shiung", "Chien-Shiung", N...

#> $ `Last Name`         <chr> "Bourne", "Bourne", "Keys", "Lovelace", "Nice", "Wu", "Wu", NA, "Joyce", "Lam...

#> $ `Employee Status`   <chr> "Teacher", "Teacher", "Teacher", "Teacher", "Administration", "Teacher", "Tea...

#> $ Subject             <chr> "PE", "Drafting", "Music", NA, "Dean", "Physics", "Chemistry", NA, "English",...

#> $ `Hire Date`         <dbl> 39690, 39690, 37118, 27515, 41431, 11037, 11037, NA, 32994, 27919, 42221, 347...

#> $ `% Allocated`       <dbl> 0.75, 0.25, 1.00, 1.00, 1.00, 0.50, 0.50, NA, 0.50, 0.50, NA, NA, 0.80

#> $ `Full time?`        <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", NA, "No", "No", "No", "No", ...

#> $ `do not edit! --->` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA

#> $ Certification       <chr> "Physical ed", "Physical ed", "Instr. music", "PENDING", "PENDING", "Science ...

#> $ Certification__1    <chr> "Theater", "Theater", "Vocal music", "Computers", NA, "Physics", "Physics", N...

#> $ Certification__2    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA

With this:

roster <- roster_raw %>%

  clean_names() %>%

  remove_empty(c("rows", "cols")) %>%

  mutate(hire_date = excel_numeric_to_date(hire_date),

         cert = coalesce(certification, certification_1)) %>% # from dplyr

  select(-certification, -certification_1) # drop unwanted columns

With the clean_names() function, we’re telling R that we’re about to use janitor. Then we clean the empty rows and columns, and then using dplyr we change the format for the dates, create a new column with the information of certification and certification_1 , and then drop them.

And with this piece of code…

roster %>% get_dupes(first_name, last_name)

we can find duplicated records that have the same name and last name.

The package also introduces the tabyl function that tabulates the data, like table but pipe-able, data.frame-based, and fully featured. For example:

roster %>%

  tabyl(subject)

  #>     subject n    percent valid_percent

#>  Basketball 1 0.08333333           0.1

#>   Chemistry 1 0.08333333           0.1

#>        Dean 1 0.08333333           0.1

#>    Drafting 1 0.08333333           0.1

#>     English 2 0.16666667           0.2

#>       Music 1 0.08333333           0.1

#>          PE 1 0.08333333           0.1

#>     Physics 1 0.08333333           0.1

#>     Science 1 0.08333333           0.1

#>        <NA> 2 0.16666667            NA

You can do a lot more things with the package, so visit their site and give them some love :)

5. Esquisse — RStudio add-in to make plots with ggplot2

https://github.com/dreamRs/esquisse

This add-in allows you to interactively explore your data by visualizing it with the ggplot2 package. It allows you to draw bar graphs, curves, scatter plots, and histograms, and then export the graph or retrieve the code generating the graph.

Installation

Install from CRAN with :

# From CRAN

install.packages("esquisse")

Usage

Then launch the add-in via the RStudio menu. If you don’t have data.frame in your environment, datasets in ggplot2 are used.

ggplot2 builder addin

Launch the add-in via the RStudio menu or with:

esquisse::esquisser()

The first step is to choose a data.frame :

Or you can use a dataset directly with:

esquisse::esquisser(data = iris)

After that, you can drag and drop variables to create a plot:

You can find information about the package and sub-menus in the original repo:

dreamRs/esquisse
RStudio add-in to make plots with ggplot2. Contribute to dreamRs/esquisse development by creating an account on GitHub. github.com

4. DataExplorer — Automate data exploration and treatment

https://github.com/boxuancui/DataExplorer

Exploratory Data Analysis (EDA) is an initial and important phase of data analysis/predictive modeling. During this process, analysts/modelers will have a first look of the data, and thus generate relevant hypotheses and decide next steps. However, the EDA process can be a hassle at times. This R package aims to automate most of data handling and visualization, so that users could focus on studying the data and extracting insights.

Installation

The package can be installed directly from CRAN.

install.packages("DataExplorer")

Usage

With the package you can create reports, plots, and tables like this:


## Plot basic description for airquality data


plot_intro
(airquality)


## View missing value distribution for airquality data


plot_missing
(airquality)


## Left: frequency distribution of all discrete variables


plot_bar
(diamonds)

## Right: `price` distribution of all discrete variables


plot_bar
(diamonds, with = "price")


## View histogram of all continuous variables


plot_histogram
(diamonds)

You can find much more like this on the package’s official webpage:

Automate data exploration and treatment
Automated data exploration process for analytical tasks and predictive modeling, so that users could focus on… boxuancui.github.io

And in this vignette:

Introduction to DataExplorer
This document introduces the package DataExplorer, and shows how it can help you with different tasks throughout your… boxuancui.github.io

3. Sparklyr — R interface for Apache Spark

https://github.com/rstudio/sparklyr

Sparklyr will allow you to:

Connect to Spark from R. The sparklyr package provides a
complete dplyr backend.
Filter and aggregate Spark datasets, and then bring them into R for
analysis and visualization.
Use Spark’s distributed machine learning library from R.
Create extensions that call the full Spark API and provide
interfaces to Spark packages.

Installation

You can install the Sparklyr package from CRAN as follows:


install.packages("sparklyr")

You should also install a local version of Spark for development purposes:



library
(sparklyr)

spark_install(version = "2.3.1")

Usage

The first part of using Spark is always creating a context and connecting to a local or remote cluster.

Here we’ll connect to a local instance of Spark via the spark_connect function:

library(sparklyr)

sc <- spark_connect(master = "local")

Using sparklyr with dplyr and ggplot2

We’ll start by copying some datasets from R into the Spark cluster (note that you may need to install the nycflights13 and Lahman packages in order to execute this code):

install.packages(c("nycflights13", "Lahman"))

library(dplyr)

iris_tbl <- copy_to(sc, iris)

flights_tbl <- copy_to(sc, nycflights13::flights, "flights")

batting_tbl <- copy_to(sc, Lahman::Batting, "batting")

src_tbls(sc)


## [1] "batting" "flights" "iris"

To start with, here’s a simple filtering example:

# filter by departure delay and print the first few records

flights_tbl %>% filter(dep_delay == 2)


## # Source:   lazy query [?? x 19]

## # Database: spark_connection

##     year month   day dep_time sched_dep_time dep_delay arr_time

##    <int> <int> <int>    <int>          <int>     <dbl>    <int>

##  1  2013     1     1      517            515         2      830

##  2  2013     1     1      542            540         2      923

##  3  2013     1     1      702            700         2     1058

##  4  2013     1     1      715            713         2      911

##  5  2013     1     1      752            750         2     1025

##  6  2013     1     1      917            915         2     1206

##  7  2013     1     1      932            930         2     1219

##  8  2013     1     1     1028           1026         2     1350

##  9  2013     1     1     1042           1040         2     1325

## 10  2013     1     1     1231           1229         2     1523

## # ... with more rows, and 12 more variables: sched_arr_time <int>,

## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,

## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,

## #   minute <dbl>, time_hour <dttm>

Let’s plot the data on flight delays:

delay <- flights_tbl %>% 

  group_by(tailnum) %>%

  summarise(count = n(), dist = mean(distance), delay = mean(arr_delay)) %>%

  filter(count > 20, dist < 2000, !is.na(delay)) %>%

  collect

# plot delays

library(ggplot2)

ggplot(delay, aes(dist, delay)) +

  geom_point(aes(size = count), alpha = 1/2) +

  geom_smooth() +

  scale_size_area(max_size = 2)


## `geom_smooth()` using method = 'gam'

Machine Learning with Sparklyr

You can orchestrate machine learning algorithms in a Spark cluster via the machine learning functions within Sparklyr. These functions connect to a set of high-level APIs built on top of DataFrames that help you create and tune machine learning workflows.

Here’s an example where we use ml_linear_regression to fit a linear regression model. We’ll use the built-in mtcars dataset to see if we can predict a car’s fuel consumption (mpg ) based on its weight (wt ), and the number of cylinders the engine contains (cyl ). We’ll assume in each case that the relationship between mpg and each of our features is linear.



# copy mtcars into spark


mtcars_tbl <- copy_to(sc, mtcars)



# transform our data set, and then partition into 'training', 'test'


partitions <- mtcars_tbl %>%

  filter(hp >= 100) %>%

  mutate(cyl8 = cyl == 8) %>%

  sdf_partition(training = 0.5, test = 0.5, seed = 1099)



# fit a linear model to the training dataset


fit <- partitions$training %>%

  ml_linear_regression(response = "mpg", features = c("wt", "cyl"))

fit


## Call: ml_linear_regression.tbl_spark(., response = "mpg", features = c("wt", "cyl"))  

## 

## Formula: mpg ~ wt + cyl

## 

## Coefficients:

## (Intercept)          wt         cyl 

##   33.499452   -2.818463   -0.923187

For linear regression models produced by Spark, we can use summary() to learn a bit more about the quality of our fit and the statistical significance of each of our predictors.


summary(fit)


## Call: ml_linear_regression.tbl_spark(., response = "mpg", features = c("wt", "cyl"))  

## 

## Deviance Residuals:

##    Min     1Q Median     3Q    Max 

## -1.752 -1.134 -0.499  1.296  2.282 

## 

## Coefficients:

## (Intercept)          wt         cyl 

##   33.499452   -2.818463   -0.923187 

## 

## R-Squared: 0.8274

## Root Mean Squared Error: 1.422

Spark machine learning supports a wide array of algorithms and feature transformations, and as illustrated above, it’s easy to chain these functions together with dplyr pipelines.

Check out more about machine learning with sparklyr here:

sparklyr
An R interface to Spark spark.rstudio.com

And more information in general about the package and examples here:

sparklyr
An R interface to Spark spark.rstudio.com

2. Drake — An R-focused pipeline toolkit for reproducibility and high-performance computing

Nope, just kidding. But the name of the package is drake !

This is such an amazing package. I’ll create a separate post with more details about it, so wait for that!

Drake is a package created as a general-purpose workflow manager for data-driven tasks. It rebuilds intermediate data objects when their dependencies change, and it skips work when the results are already up to date.

Also, not every run-through starts from scratch, and completed workflows have tangible evidence of reproducibility.

Reproducibility, good management, and tracking experiments are all necessary for easily testing others’ work and analysis. It’s a huge deal in Data Science, and you can read more about it here:

From Zach Scott :

Data Science’s Reproducibility Crisis
What is Reproducibility in Data Science and Why Should We Care? towardsdatascience.com

Toward Reproducibility: Balancing Privacy and Publication
Can there ever be a Goldilocks option in the conflict between data security and research disclosure? towardsdatascience.com

And in an article by me :)

Manage your Machine Learning Lifecycle with MLflow — Part 1.
Reproducibility, good management and tracking experiments is necessary for making easy to test other’s work and… towardsdatascience.com

With drake , you can automatically

Launch the parts that changed since last time.
Skip the rest.

Installation



# Install the latest stable release from CRAN.



install.packages
("drake")



# Alternatively, install the development version from GitHub.



install.packages
("devtools")


library
(devtools)


install_github
("ropensci/drake")

There are some known errors when installing from CRAN. For more on these errors, visit:

The drake R Package User Manual
The drake R Package User Manualropenscilabs.github.io

I encountered a mistake, so I recommend that for now you install the package from GitHub.

Ok, so let’s reproduce a simple example with a twist:

I added a simple plot to see the linear model within drake ’s main example. With this code, you’re creating a plan for executing your whole project.

First, we read the data. Then we prepare it for analysis, create a simple hist, calculate the correlation, fit the model, plot the linear model, and finally create a rmarkdown report.

The code I used for the final report is here:

If we change some of our functions or analysis, when we execute the plan, drake will know what has changed and will only run those changes. It creates a graph so you can see what’s happening:

In Rstudio, this graph is interactive, and you can save it to HTML for later analysis.

There are more awesome things that you can do with drake that I’ll show in a future post :)

1. DALEX — Descriptive mAchine Learning EXplanations

Explaining machine learning models isn’t always easy. Yet it’s so important for a range of business applications. Luckily, there are some great libraries that help us with this task. For example:

thomasp85/lime
lime — Local Interpretable Model-Agnostic Explanations (R port of original Python package) github.com

(By the way, sometimes a simple visualization with ggplot can help you explain a model. For more on this check the awesome article below by Matthew Mayo )

Interpreting Machine Learning Models: An Overview
An article on machine learning interpretation appeared on O’Reilly’s blog back in March, written by Patrick Hall, Wen… www.kdnuggets.com

In many applications, we need to know, understand, or prove how input variables are used in the model, and how they impact final model predictions.DALEX is a set of tools that helps explain how complex models are working.

To install from CRAN, just run: